Group 24: The Clusters in our Stars

Name andrewID
Sivan Mehta skmehta
Patrick Chang pgchang
Joyce Sun joyces
Suvrath Penmetcha spenmetc

Data and Subgroup Description

Data source: The dataset was collected 59,946 OKCupid users from San Francisco from June 2012. The dataset contains demographic information, lifestyle information, and text responses to 10 essay questions. A single row of the given dataset represents 1 profile of a user who had opted into the survey.

The demographic information includes location, age, body type, ethnicity, height, gender. The lifestyle and career information includes whether or not you have kids, sexual orientation, religion, smoking habits, diet, occupation, and income. While there were some continiously measured variables (age, height, and income), most were reported as categorical variables with particular aspects. For example: in addition to reporting your religion, you could also report how much this answer really meant to you. One concrete example would be Buddhism, and but not too serious about it.

We chose the following subgroup: [Describe your subgroup here. Describe any data manipulations that were necessary to create your subgroup] People under 30. To obtain this subset, we simply used the subset functions like this: sample <- subset(profiles, age < 30)

We are interested in exploring the following about our subgroup because it represents the majority of OKCupid users and likely the most active users on the website.

Sentiment Analysis

Graph 1. Word Count vs. Sentiment

ggplot(subset(profiles, polarity < 2 & wc < 500)) + 
            geom_point(aes(x = polarity, y = wc)) + 
            ggtitle("Word Count vs. Sentiment")

In this graph, we are graphing the continuous variables polarity and wc, or word count, which were both generated. The wc variable was generated by preprocessing the essay0 variable in the dataset with the following code to generate the total number of words in the provided string:

profiles$wc <- sapply(strsplit(as.character(profiles$essay0), " "), length)

To generate the polarity value we took the existing dataset and preprocessed it using the following code:

library(qdap)
sample <- profiles[sample(nrow(profiles), 2000), ]
sample$polarity <- polarity(sample$essay0)$all$polarity

Because the polarity function from the qdap package takes a long time, we had to take a random sample of 2000 profiles from the dataset. This function simply takes a string and returns the polarity of the text, or simply how positive or negative the sentiment of the text is. Both the preprocessing for polarity and wc were done in a preprocess.r file that was provided in the final project.

We took a subset here simply to leave out what we saw as outliers. This helps “zoom” in the graph to the more interesting pieces, and helps make the group structure easier to see.

In this graph, we see wc on the vertical axis and polarity on the horizontal axis. The black coloring of the points and the size of the points are all the same and have no actual bearing. An interesting feature that certainly stands out in this graph is the clear groups that form along lines that look like variations of a \(y=\frac{ 1 }{ x }\) kind of curve. Upon further analysis, it would be interesting to see if these groups correspond to as they don’t correspond to any of the categorical variables provided in the dataset

Graph 2. Age vs. Income

correlation <- cor(profiles$income, profiles$age, use = "na.or.complete") ** 2
mytext <- paste("Age vs. Income with", "r^2 = ", correlation, ", colored by sentiment")
wanted <- subset(profiles, income < 250000 & income > 0)
        
ggplot(wanted) + 
    aes(x = age, y = income) + 
    geom_point(aes(colour = polarity)) + 
    scale_colour_gradientn(colours=c("#FF0000", "#ffffff", "#440088","#0000FF")) + 
    geom_smooth(method = "lm") + 
    ggtitle(mytext)

Using the preprocessed polarity variable, I then tried to use it to explore the relationship between age and income. Additionally, I plotted a smoothing line to see if there was an overall trend that could be teased out. While one would presume that older people would be older simply because they have advanced further in their career, this hypothesis can only losely be corroborated.

We can actually check a linear regression model with the following command:

summary(lm(wanted$age ~ wanted$income))
## 
## Call:
## lm(formula = wanted$age ~ wanted$income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.742  -6.211  -2.679   4.144  36.410 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.641e+01  8.590e-01  30.750  < 2e-16 ***
## wanted$income 1.089e-04  1.304e-05   8.347 1.18e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.242 on 397 degrees of freedom
## Multiple R-squared:  0.1493, Adjusted R-squared:  0.1471 
## F-statistic: 69.66 on 1 and 397 DF,  p-value: 1.179e-15

Because our p-value for the slope is < .05, we reject the null hypothesis that this slope is not zero. However, just looking at the graph demonstrates that this model is incredibly inaccurate because of the large variation in the income variable. One limitation of this type of visualization is that while age was reported by year, income was not nearly as granular, offering only 13 distinct levels, with a vast majority not reporting at all, which is why we included income > 0 in our graphs and analysis. Additionally, we had to take a subset wanted to simply zoom in the graph because there were data points that simply did not make sense in visualizing, as they would throw of the graphs. These include people that did not report their income income > 0, or was an order of magnitude higher than everyone else’s income < 250000

When including the polarity variable by coloring by a continuous color scale, we fail to see any ascertainable trends. There is no indication that sentiment has any relation to the age or the income of the profile. These claims can be corroborated by the following graphs having relatively flat geom_smooth geometries, indicating that there is little correlation between polarity and either age or income.

base <- ggplot(wanted) +  # data
        geom_point() +    # scatterplot
        geom_smooth() +   # smoothing line
        aes(y = polarity) # polarity on the y-axis

base + aes(x = age)

base + aes(x = income)

Created, presented, and described by Sivan Mehta

Plot Two Categorical

plotTwoCategorical = function(var1, var2, df=profiles){
  var1.c = c()
  var2.c= c()
  
  # data frame creation
  var1.list = unique(df[[var1]])
  var2.list = unique(df[[var2]])
  for(i in unique(df[[var1]])){
    var1.c = c(var1.c, rep(i,length(unique(df[[var2]]))))
  }
  var2.c = rep(var2.list,length(var1.list))
  temp.df = as.data.frame(as.character(var1.c))
  colnames(temp.df) = "var.1"
  temp.df$var.2 = as.character(var2.c)
  temp.df$var.1 = as.character(temp.df$var.1)
  temp.df$count = rep(0, nrow(temp.df))
  temp.df$freq = rep(0,nrow(temp.df))
  for(i in 1:nrow(temp.df)){
    temp.df$count[i] = nrow(subset(df, df[[var1]] == temp.df$var.1[i] & 
                                     df[[var2]] == temp.df$var.2[i]))
  }
  for(i in 1:nrow(temp.df)){
    temp.df$freq[i] = temp.df$count[i] / sum(temp.df$count[which(temp.df$var.1 == temp.df$var.1[i])]) 
  }
  
  plot1 = ggplot(temp.df) + geom_point(aes(x=var.1,y=var.2, size = freq)) +
          scale_size(range = c(0, 25)) +
            ggtitle(paste("Proportion Breakdown of ",gsub("_", " ", var1),
                          " and ", gsub("_", " ", var2), sep="")) + 
            labs(x=gsub("_", " ", var1),y=gsub("_", " ", var2))
  ggplotly()
}

plotTwoCategorical("body_type","drinks",profiles)

This allows for an easy and early data analysis of different relationships that we may want to examine. For example, looking at drinking habits based on body types yields two interesting relationships: a higher proportion od desperate drinkers are jacked or used up than any other body type. Further, people who consider themselves above average in terms of weight (full figured, a little extra) tend to drink less often (a higher proportion of them report not drinking at all).

The graph is used as a tool to determine marginal distributions of any two categorical variables in the data. It allows for a simpler comparison along the x-axis, where each variable on the x-axis sums to 1. This is effectively a stacked bar chart for proportions, but since each dot is spaced out equally, we can compare across the y-axis too. This was done by creating a a dataframe of the length of variable1*variable2, then counting the occurances of each intersection. The only manipulation of the data required is to convert all the factor-level data into strings via as.character.

Created, presented, and described by Patrick Chang

Word Clouds

getTraitsCloud = function(dataset){
  corpus = Corpus(VectorSource(dataset$essay9))
  plaintext = tm_map(corpus, PlainTextDocument)
  crude_plain <- tm_map(plaintext, removePunctuation)
  crude_plain <- tm_map(crude_plain, removeWords, stopwords('english'))
  crude_plain <- tm_map(crude_plain, stemDocument)
  wordcloud(crude_plain, max.words = 100, random.order = FALSE)
}


getTraitsCloud(profiles[which(profiles$body_type == "fit"),])

This looks at a word cloud of what traits fit people desire in their partners (done by parsing essay9). While there are a few words what do not correspond to traits, we see that most fit people in our sample want partners who are honest, fun, humorous, genuine, active, and adventurous. While not very telling by itself, this lends more information by telling us what traits potential partners of each body type are looking for.

This was created with the wordCloud library. We converted essay9 (“you should message me if”) into a plaintext corpus, then parsed it to remove punctuation and some common stopwords. It then returns a word cloud of length 100, as there are multiple words that do not give information, but are not coded as stopwords. There is no manipulation required for generating this graphic.

Created, presented, and described by Patrick Chang

Livestyle vs. Age

ggplot(data = profiles, aes(x = drinks)) +
      geom_violin(aes(y = age)) + 
      ggtitle("Violin Plot of Age vs. Drinking Habits") +
      theme(axis.text.x = element_text(angle = 90))

From this graphic, we can see that the distribution of age varies depending on drinking habits. Users who drink desperately tend to be young (under 30) or old (over 50), while number of users who drink never, rarely, or socially decrease as age increases.In the Shiny App, the variable on the x-axis can be changed to examine relationships between age and smoking habits, diet, and body type as well.

This graph was created using ggplot() for the base, geom_violin() for the plot type, and theme() to manipulate axis labels. In the Shiny code, the ui script shows drop-down input options for x-variables, with default set to drinks. The server script reads the input chosen by the user using aes_string(), and plots the variable on the x-axis.

Created, presented, and described by Joyce Sun

Traits Offered vs. Demanded

ggplot(data = profiles, aes(x = n_traits, y = n_demands)) +
      geom_point(aes(col = age)) +
      ggtitle("Traits Offered vs Demanded from Partner") + 
      xlim(0,100) + ylim(0,40)

This graph shows the number of traits a user looks for in a partner vs the number of traits the user offers about him/herself.The majority of users in our subset offer fewer than 25 traits about themselves but also ask for fewer than 10 traits from a potential match.There are some users who offer or demand more traits, including 5 potential outliers with greater than 100 traits offered or greater than 40 traits demanded, who were removed for better visibility. In the Shiny app, the observations can be colored according to various lifestyle variables: age, drinking habits, smoking habits, diet, and body type. Generally, there do not appear to be significant patters in the distribution of traits offered or demanded depending on the lifestyle variables.

In order to estimate the number of traits offered and demanded from matches, we manipulated the essay data in the original dataset. Number of traits offered is the number of clauses in each user’s essay0 (“About me”), cut by periods, commas, and “and”’s. Number of traits demanded is the number of clauses in each user’s essay9 (“You should message me if”), cut according to the same criteria as traits offered.The graph was created using ggplot() for the base, geom_point() for the scatterplot points, and aes_string() to identify variable to color by. Finally, we used xlim() and ylim() to zoom into the scatterplot- this cut out 5 potential outliers, but made the remaining observations much more readable.

Created, presented, and described by Joyce Sun

Maps

locations_under_30 <- read.csv("locations_under_30.csv")
profiles <- subset(profiles, smokes != "" & drugs != "" & drinks != "")
under_30 <- subset(profiles, age<30)
under_30 <- merge(under_30, locations_under_30, "location")
map <- get_map(location = c(lon = -122.286884, lat = 37.670770), zoom = 10)
ggmap(map) + 
    ggtitle("OKCupid Users in Calfornia Under 30") + 
    geom_point(aes(x = lon, 
                   y = lat, 
                   size = number_of_people, 
                   color = smokes), 
                data = profiles)

This map displays where the OK-Cupid users are located around San Francisco and also how many users are located in a particular area. We can see there are large number of OK-Cupid users in San Francisco and also in Oakland and Berkeley. Furthermore, we can take a look of categorical variables such as drinking to compare the drinking habits of various Ok-cupid populations across San Francisco. For example, we see that the majority of OK-Cupid users in Daly city rarely drink.

To make this map, I used ggmap and also I subsetted our under 30 population data to look for all the unique locations. Then I created another csv file that contains all the unique locations and their longitude and latitude and the number of people at that location. I used the geocode function to the find the latitude and longitude and I created another CSV file because google maps limits the number of queries you can make.

Created, presented, and described by Suvrath Penmetcha

Mosaic Plot

tab <- table(profiles$drugs, profiles$smokes)
mosaicplot(tab, xlab = "Drugs", ylab = "Smokes", main = "Mosaic Plot of Drugs vs. Smokes Under 21", shade = T, las = 1)

This mosaic plot shows the association between the drug and smoking habits of OK-Cupid users. We can see that since the plot is colored there is an association between people’s drug and smoking habits. For example, we can see that users that don’t smoke almost never do drugs. Likewise, people who often do drugs also smoke.

To make this mosaic plot I had to use the mosaicplot function in R and also create a table of the two categorical variables of interest such as drug and smoking habits. Also, I had to remove any observations where users didn’t provide an answer to their drug and smoking habits.

Created, presented, and described by Suvrath Penmetcha

Conclusions

(no more than two paragraphs)

Summarize the main points and takeaways from your work on this project.

Is there anything that you would explore further if you had the time?